Recent developments in Machine Translation a review of the last five years

نویسنده

  • W.John Hutchins
چکیده

multilevel tree representations which combine syntactic, logical and semantic relationships), lexical transfer (lexical substitution with some structural changes), structural transfer (tree transduction), syntactic generation, morphological generation (trees to strings). The long-term aim of the GETA project is a multilingual system producing 'good enough' results, i.e. accepting the need for post-editing. The system is essentially, like Eurotra, a linguisticsoriented system; it does not claim to use any 'deep understanding' or 'intelligence', and hence no AItype explicit 'expertise' is incorporated in GETA-ARIANE although the possibility of grafting on an 'expert' error correction mechanism was investigated by Boitet and Gerber (1986). However, unlike other linguistics-based systems, Ariane extends translation analysis to sequences of several sentences or paragraphs, in order to deal with problems of anaphora and tense/aspect agreement. For practical production the system permits optional pre-editing, primarily the marking lexical ambiguities; post-editing can be done using the REVISION program developed for ARIANE-78. It is a mainframe batch system with no human interaction during processing. However, Zajac (1986) has investigated an interactive analysis module for GETA, somewhat on the lines of Tomita's research at Carnegie-Mellon (Tomita 1986). One important development has been the refinement of the theoretical basis, particularly the clarification of the distinction and the relationship between dynamic and static grammars in the system. Static grammars (or SCSG 'structural correspondence static grammars') record the correspondences between NL strings and their equivalent interface structures in a formalism which is neutral with respect to analysis and synthesis. The processes of analysis and generation are handled by 'dynamic grammars' written in appropriate 'special languages' (SLLPs or Special Languages for Linguistic Programming): ATEF for morphological analysis, ROBRA for structural analysis, structural transfer and syntactic generation, EXPANS for lexical transfer, and SYGMOR for morphological generation. (The distinction between 'static' and 'dynamic' grammars is now found in many advanced transfer systems; the GETA project has been a leading force in this theoretical development.) Equally important have been the improvements to the research environment, in tools for the development of systems, such as ATLAS for lexicographic work and VISULEX for viewing complex dictionary entries. Such tools are components of a 'linguistic workstation' for MT research (an idea also being developed by the Saarbrücken and the Kyoto groups, 15.and below). Within this environment the work of the Calliope project has taken place: the compilation of the static grammars for English and French during 1983-84, their corresponding dynamic grammars, and the substantial lexicographic work. The Grenoble group has always encouraged and supported other MT projects using GETA software, and thereby helped to train MT researchers. ARIANE is regarded above all as "an integrated programming environment" for the development and building of "a variety of linguistic models, in order to test the general multilingual design and the various facilities for lingware preparation..." The ARIANE software has been tested on an impressive range of languages, often in small-scale experiments (Vauquois and Boitet 1985/1988; Hutchins 1986: 247-8; Boitet 1987a), but sometimes in larger projects, e.g. the English-Malay project mentioned elsewhere in this survey. The largest GETA-ARIANE system has been for Russian-French translation, which built upon previous experience with CETA. Since 1983 this system has been extensively and regularly tested in an experimental 'translation unit'; large corpora of text have been translated, including some 200,000 running words during one 18-month period (Boitet 1987b). Another large-scale system was the German-French system developed by Guilbaud and Stahl, using the same generator programs as in the Russian-French system. Its principal features were the attention given to morphological derivation and inflection, and the restriction of structural analysis almost wholly to morphological and syntactic data, with little or no use of semantic information. The system has been described by Guilbaud (1984/1987), but there has been little development of the system since 1984 (Boitet 1987b). The most important practical application of a large-scale system has, however, been through GETA's involvement in the French national computer-assisted translation project (NCATP). Launched in November 1983 (after a preparatory stage in 1982-83, the ESOPE project), the Calliope project has been financed 50% from public funds (administered by the Agence d'Informatique) and 50% from private sources. One source has been B'VITAL, founded in 1984 by the Grenoble group, which is responsible for the machine-readable dictionaries and for the 'static grammars' (Joscelyne 1987). Another has been Sonovision, which was to provide the aeronautics terminology for the major French-English system Calliope-Aero. After a demonstration of a prototype of Calliope-Aero at Expolangues in February 1986, it was decided to develop also an English-French system for the translation of computer science and data processing materials, Calliope-Info. In addition to these MT systems, both batch systems, the project was also to produce a translator's workstation (Calliope-Revision, organised around a Bull Questar 400 microcomputer) for preparing and post-editing texts and for access to remote term banks and including OCR and desk-top publishing facilities. This was essential if the systems were to be fully integrated into an industrial documentation environment. However, given the expected delays there have been plans by SG2 (one of the backers) to develop a terminology aid with split-screen word processing, Calliope-Manuel. Whatever the commercial feasibility of the Calliope project, which came to a formal end in February 1987 (Boitet 1987a), the experience will no doubt be put to good use by the GETA project, in particular the experience of dealing with complex dictionaries and the type of scientific and technical sublanguage presented by aeronautics. Boitet (1986), for example, mentions the successful treatment of complex noun phrases (e.g. la jonction bloc frein et raccord de tuyauterie) and complex adjectival phrases (e.g. comprise entre les deux index noir). Other problems did not occur in the sublanguage and thus were put aside, e.g. interrogatives, relative clauses introduced by dont, imperatives, certain comparatives, nominal groups which do not only consist of nouns, and so forth. The NCATP has had other consequences. It stimulated the conversion of the ARIANE-85 to run on IBM PC AT (with a minimum 20MB hard disk), adequate for MT development but not for a production system. It also encouraged the writing of new software in a French dialect of LISP (Boitet 1986; Boitet 1987a, 1987b), with the aim of creating a fully multilingual system with a single 'special language' for processing strings and trees (TETHYS). Clearly, GETA has continued to advance the boundaries of MT research. 14. While GETA is the main MT research centre in France, there are other MT projects in Nancy and Poitiers. At Nancy, Chauché (1986; Rolf & Chauché 1986) continues his research, begun at Grenoble, on algorithms for tree manipulation which are suitable for MT systems. Tests of the algorithms have been applied to Spanish-French and Dutch-French experiments (in collaboration with Rolf of Nijmegen University). From Poitiers, Poesco (1986) reports a smallscale knowledge-based MT experiment for translating Rumanian texts on three dimensional geometry into French. The ATN parser produces a conceptual frame-slot representation from which the generator devises a 'plan' for producing TL output. The restricted language system TITUS, designed for multilingual treatment of abstracts in the textile industry, has expanded in its latest version TITUS IV (Ducrot 1985) in order to deal with a wider range of subjects and to allow somewhat freer expression of contents. As elsewhere, there is commercial interest in translators’ workstations: Cap Sogeti Innovations is proposing a "language engineering workshop", providing 'intelligent' language tools, a dedicated multilingual word processor, a natural language knowledge base, a technical summary writer, and a 'text analyzer' which will produce abstract meaning representations. Details are necessarily vague at present (Joscelyne 1987). Attitudes to MT in France are most likely to be changed by the provision of MT services on Minitel. The availability of Systran has already been mentioned (sect.1 above). Other services include a number of dictionaries and term banks: the Harrap French and English slang dictionary, the Dictionary of Industries, Normaterm (the term bank of the French standards organisation AFNOR), the DAICADIF lexicon for telecommunications, and (next year) FRANTEXT the historical dictionary Tresor de la Langue Française. 15. The largest and most long-established MT group in Germany is based at Saarbrücken. It began in the mid-1960's with research on Russian-German translation, sponsored from 1972 to 1986 by the Deutsche Forschungsgemeinschaft. The SUSY project expanded into a multilingual system, based on the transfer approach, with the source languages German, Russian, English, French, and Esperanto, and the target languages German, English and French. Detailed descriptions of the latest version SUSY II as at the end of 1984 are given by Maas (1984/1987) and by Blatt et al. (1985), and summarised by Hutchins (1986: 233-239). The most recent developments of MT research at Saarbrücken are to be found in Zimmermann et al. (1987). The most significant are the changes introduced into the basic design by the introduction of English as a SL (in SUSY-E project), the development of explicit formalisms and software tools for testing natural language processing and MT models and for general computational linguistic experimentation (SAFRAN: Software and Formalism for the Representation of Natural Language see Licher et al. 1987), the planned application of SUSY as the foundation of a production-oriented system (STS), the new direction of Saarbrücken MT research in the ASCOF project for French and German translation, and the involvement of SUSY personnel in the Eurotra project. The greater emphasis on product-oriented research has arisen in part from the ending of direct DFG funding in 1986. The project MARIS (Multilinguale Anwendung von ReferenzInformationssystemen) was established in mid-1985 at the University of the Saar to develop a multilingual information retrieval system, in particular to meet the needs of German-speaking users of English-language documentation (Zimmermann et al. 1987; Luckhardt 1987b). For this purpose, the MARIS team is developing a computer-assisted translation system STS (Saarbrücker Translationsservice) based on the Saarbrücken MT research. Initially only English-German will be developed, and it will restricted to the translation of abstracts and titles of journal articles. There are three phases to the project: first a manual system in which translators can have access to computerbased term banks, secondly the addition of automatic lookup of terminology, and thirdly the application of SUSY as a post-edited MT system. The chief emphasis will be on lexicographic data and sublanguage information in the particular fields of application: housing and construction, environment, standards, social sciences. At a later stage it is hoped to add French-German and German-French versions. The MARIS project is a natural continuation of basic MT research on SUSY, some earlier experience with a prototype translator's workstation (SUSANNAH) and the long-established research at Saarbrücken under Harald Zimmermann on information retrieval systems. The ASCOF project (Projektbereich C at Saarbrücken) grew out of research at Saarbrücken on computational methods for the analysis of the 'Archive du Français Contemporain' (established in mid-1960's). The project team initially worked in close collaboration with GETA. However, from the mid 1970's, after the elaboration of French analysis programs for SUSY, the team established closer links with the Saarbrücken MT projects. For a while the research was using both GETA and SUSY algorithms, but since 1977 it has concentrated on SUSY-type methods, and in 1981 it emerged as an independent MT project at Saarbrücken (Scheel 1987). Its distinctive features are: the use of the COMSKEE programming language, the integration of syntactic and semantic analysis, the adoption of ATN parsers, and the use of semantic networks as a 'knowledge base' for disambiguation. ASCOF (Biewer et al. 1985/1988; Stegentritt 1987) is a system for French-German translation with a multilevel modular transfer design. Programs are written in COMSKEE (Computing and String Keeping Language), the programming environment developed at Saarbrücken. The object of ASCOF (=Analyse und Synthese des Französischen mit Comskee) is fundamental MT research not a practical system. Analysis is in three basic phases. The first phase of morphological analysis is followed by a second phase in three parts: disambiguation of word class homographs, identification of noncomplex syntactic groups, and segmentation of sentences into independent (unrelated) parts, e.g. noun, verb and prepositional phrases. In the third phase, structural analysis is realised by a series of cascaded ATN parsers which combine syntax (e.g. functional relations) and semantics (e.g. case frames or valencies), with no priority given to either one or the other. Analysis modules operate not sequentially but interactively: thus analyses of verb phrases, complex noun phrases, complements, coordination etc. interact with each other. The integration of syntax and semantics in ASCOF analysis contrasts with procedures in SUSY and other linguistics-based transfer systems. Lexical disambiguation is achieved by reference to semantic networks which include information on synonymy, homonymy, hyponymy (e.g. whole-part, genus-species), and semantic-functional frames for verbs. As in GETA and EUROTRA, the results of analysis are not interlingual representations but SL canonical trees in which SL-specific lexical and syntactic ambiguities have been resolved. Transfer operates in a familiar way with bilingual lexical substitution and structural tree transduction, and it is followed by TL syntactic synthesis and morphological synthesis for the production of TL text output. Most of the ASCOF research effort has concentrated on problems of analysis and on testing the semantic network approach to disambiguation. Consequently the transfer and synthesis components have not yet been fully developed; there is no French synthesis program and only a small German one, so only partial implementation of translation from French into German has been possible so far (Stegentritt 1987). The system is to be tested on EEC agricultural texts, and this corpus has provided the data for the illustrative semantic networks. The quality of the output is considered to be crucially dependent on the development and elaboration of the semantic networks, the linguistic 'knowledge base' of the system. ASCOF is an example of a transfer system of the third generation of MT, incorporating AI-style 'knowledge base' semantic analysis, and aiming in the long-term for high quality (batch) translation. 16. ASCOF has not been the only context in which research on knowledge based approaches to linguistic analysis has been conducted at Saarbrücken. There has been activity in text-oriented MT by Weber and Rothkegel. Weber (1986, 1987) has investigated a small-scale text-oriented system for MT. COAT (=Coherence Analysis of Texts) is a program, written in COMSKEE, which establishes text coherence, using information about valencies, arguments and roles, and produces complex representations of SL texts which might be translated into equivalent TL text representations. Analysis is not to the depth of AI-type understanding, only sufficient for translation. A speculative extension is OVERCOAT (not implemented) for establishing global text structures (paragraph sequencing) and AI-type discourse frames, and involving 'knowledge' of stereotypical situations and events. Rothkegel's (1986a, 1986b, 1987) research between 1981 and 1986 was devoted to TEXAN, a system for recognising text-linguistic features (illocution, thematization, coherence, etc.), for identifying text types (specifically EEC treaties) and consequently enabling text-specific semantic and structural analysis, disambiguation and transfer. The Saarbrücken group had collaborated with Kyoto to develop a system for translating German journal titles into Japanese. SUSY was used for the analysis of German titles and TITRAN for the generation of Japanese titles (Ammon & Wessoly 1984-85). There has subsequently been a similar project at Stuttgart to produce German translations of Japanese journal titles in the information technology field (Laubsch et al. 1984, Rosner 1986a, 1986b). The SEMSYN (Semantische Synthese) system takes as input the semantic interface representations of Japanese texts produced by Fujitsu's ATLAS/II system (cf. 30 below). ATLAS was designed for JapaneseEnglish translation and so the semantic interface representations were not completely sufficient for German synthesis, since they gave few indications of number, definiteness, or tense. SEMSYN is a semantics-based MT system (or rather partial system); it incorporates a frame description formalism (cases, roles, modalities, scopes, purpose, part-whole relations, etc.) from which German titles are generated by reference to a restricted knowledge base of linguistic and extra-linguistic information. Like many AI-inspired systems, SEMSYN is written in LISP. The Saarbrücken group was an early participant in the Eurotra project and it has contributed a number of theoretical studies. Two recent examples are the work of Steiner (1986) on generation and of Schmidt (1986) on valency structures. However, the West German Ministry for Research and Technology (Bundesministerium für Forschung und Technologie) has also established three groups in Berlin, Bielefeld and Stuttgart to undertake theoretical research on behalf of the Eurotra project. The BMFT project as a whole is known as NASEV (Neue Analyseund Syntheseverfahren zur maschinellen Übersetzung). At Stuttgart, Rohrer (1986a, 1986b) has been investigating the relevance of formal linguistics to the theoretical basis of MT. He advocates unification grammars (e.g. LFG, GPSG, FUG) as offering the most appropriate general frameworks for future advances in MT research. At the Technical University of Berlin, Hauenschild (1986) has been investigating AI approaches to problems of MT transfer. This research, under the acronym KIT (Künstliche Intelligenz und Textverstehen), commenced in April 1985, and can be regarded as a continuation of her previous work on the CON3TRA project at the University of Konstanz and on the earlier SALAT project at Heidelberg. Hauenschild's model proposes (i) SL analysis in terms of a modified Generalized Phrase Structure Grammar, (ii) conversion into an intentional logic representation directly from the GPSG analysis (applying compositional semantic rules in the manner of Montague grammar), and then (iii) conversion into two levels of semantic representation: a level of 'referential nets' linking text referents, and a level of global 'text argument' structures recording intersentential relations. Transfer would operate at multiple levels: lexical (at semantic representations), sentence-semantic (at intentional logic representation, i.e. in order to preserve informational structure of SL texts), and syntactic (i.e. from 'superficial syntactic' GPSG analyses). The semantic representation language is a propositional logical formalism (with variables and operators), and includes knowledge of facts, rules and objects. The 'argument structure' representation is regarded as genuinely interlingual in so far as logical, case and argument features may be 'universal'. However, the precise division of levels is still fluid. As a MT model, Hauenschild's work represents the convergence of many recent strands of MT theory.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine learning algorithms in air quality modeling

Modern studies in the field of environment science and engineering show that deterministic models struggle to capture the relationship between the concentration of atmospheric pollutants and their emission sources. The recent advances in statistical modeling based on machine learning approaches have emerged as solution to tackle these issues. It is a fact that, input variable type largely affec...

متن کامل

بهبود و توسعه یک سیستم مترجم‌یار انگلیسی به فارسی

In recent years, significant improvements have been achieved in statistical machine translation (SMT), but still even the best machine translation technology is far from replacing or even competing with human translators. Another way to increase the productivity of the translation process is computer-assisted translation (CAT) system. In a CAT system, the human translator begins to type the tra...

متن کامل

An Updated Review of Goodness of Fit Tests Based on Entropy

Different approaches to goodness of fit (GOF) testing are proposed. This survey intends to present the developments on Goodness of Fit based on entropy during the last 50 years, from the very first origins until the most recent advances for different data and models. Goodness of fit tests based on Shannon entropy was started by Vasicek in 1976 and were continued by many authors. In this paper, ...

متن کامل

ریزRNA : کوچک اما راهبردی و پر رمز و راز (مقاله مروری)

MicroRNAs form a class of small non-coding RNA molecules. With only 21-23 nucleutide in length, they have an important role in gene expression. These molecules bind to their target mRNA molecules and repress the protein expression via mRNA degradation or blocking the translation machine of the cell. From the advent of molecular biology microRNA molecules were out of focus, however huge amount o...

متن کامل

Remarks on Modern Track Geometry Maintenance

A short survey on modern track maintenance methods is given, concentrating on the developments in recent years. The ongoing refinement of the machinery should be shown as the influence of IT-solutions. On top the economic view to the track infrastructure is briefly demonstrated. Further developments in track hardware solutions must respect the obtained high level of track work mechanization. H...

متن کامل

Recent developments in configuration design and optimization of mineral separation circuits; A Review

The present research reviews two basic approaches for the separation circuit configuration analysis. The first approach is to optimize the circuit configuration. In this method, after a circuit modeling, a variety of search algorithms and mathematical optimization methods are used. Previous works show that this approach has more application in the flotation process. The second approach called t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1988